The vast majority of these notes are derived from Castella and Berger, and from Dr. Chandola's notebooks.
You should also treat these as a compliment to the Seeing Theory suggested readings that we will also discuss in class.
Sample space $S$: the set of all possible outcomes in an experiment
Event: A collection of possible outcomes (a set of possible outcomes)
Sigma algebra $\mathcal{B}$: essentially, all events that can be given a probability
Axioms of probability:
Assume there are a set of nonnegative numbers $p_1, ..., p_n$ that sum to one. If: $$P(A) = \sum_{\{i: s_i \in A\}} p_i$$
Then $P$ is a probability function (ignoring the stuff with $\mathcal{B}$, now)
Exercise: Prove that if $P(\{H\}) = P(\{T\})$, then $P(\{H\}) = P(\{T\}) = \frac{1}{2}$
If $A$ and $B$ are events in $S$, and $P(B) \gt 0$, then the conditional probability of A given B, written as $P(A|B)$, is: $$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ or equivalently, as $$P(A|B) = \frac{P(A = a, B =b)}{P(B=b)}$$ The latter notation will be useful below
Exercise Free sandwich to students. (Pg. 21-22)
Two variables are statistically independent if $P(A \cap B) = P(A)P(B)$. Naturally, if $P(A \cap B) \neq P(A)P(B)$, then the variables are not independent. Machine learning is largely about leveraging these conditional relationships between variables.
A random variable is a function from a sample space $S$ into the real numbers.
Exercise: What is a random variable we could compute with:
Exercise Define the random variable $X$ to be the number of heads on 3 coin tosses.
The cumulative distribution function, or cdf, of a random variable $X$ is defined as: $$F_X(x) = P_X(X\leq x),\mathrm{\, for\, all\,} x$$
Exercise: Draw the $cdf$ of $X$ when $X$ is the number of heads observed after tossing 3 fair coins.
The probability mass function (pmf) of a discrete random variable $X$ is given by: $$f_X(x) = P(X=x),\mathrm{\, for\, all\,} x$$
Exercise: Why is this only for discrete random variables?
Note: $$ P(X \leq b) = \sum_{k=1}^b f_X (k) = F_X(b)$$
The probability density function (pdf) of a continuous random variable $X$ is the function that satisfies: $$F_X(x) = \int_{-\infty}^x f_X(t) dt$$
A random variable $X$ is continuous if $F_X(x)$ is a continuous function of $x$. It is discrete if $F_X(x)$ is a step function of $x$.
Exercise: What are some examples of continuous variables? Of discrete ones?
Two random variables $X$ and $Y$ are identically distributed if $F_X(x) = F_Y(x)$ for all $x$
Exercise Define two identically distributed random variables that can be computed from a single coin tossed three times.
First, a note for this section: If $X$ is a RV, then any function of $X$, say $g(x)$, is also a random variable. For generality, we're going to use $g(X)$ instead of just $X$ here.
The expected value, or expectation, of a random variable is simply the variable's average value, weighted by the probability of that value occurring.
Exercise: What is the expected value of the random variable $X$, the number of heads I get when I flip a coin 100 times?
Formally, if $X$ is continuous, then:
$$ \mathbb{E} g(x) = \int_{-\infty}^{\infty} g(x) f_X(x) dx$$And if $X$ is discrete, then:
$$ \mathbb{E} g(x) = \sum_{x \in \mathcal{X}} g(x)P(X=x)$$A useful properties of expectations is linearity: if a, b are constants, then $ \mathbb{E}(aX + b) = a \mathbb{E}(X) + b$
The variance of a random variable $X$ is defined as: $$ \mathrm{Var} = \mathbb{E}(X - \mathbb{E}(X))^2 $$.
The standard deviation is equal to $\sqrt{\mathrm{Var}(X)}$.
Both of these measures give you an idea of the spread of the distribution. That is, the larger the variance, the larger the ... variability ... in $X$.
Exercise What does it mean if $\mathrm{Var}(X) = 0$?
A distribution is a simplified model of a population. Because it is useful to have the same model be able to study different populations, statistical distributions are families, rather than a single pre-specified form. A specific instantiation of that family is determined by the value of the parameter(s) for that distribution.
from scipy.stats import bernoulli, poisson, binom, norm, mvn, hypergeom
import numpy as np
from matplotlib import pyplot as plt
import matplotlib
import pandas as pd
import plotly.express as px
headimg = plt.imread('data/quarterheads.jpg')
tailimg = plt.imread('data/quartertails.jpg')
A random variable is generated from a probability distribution. There are different types of distributions defined for discrete random variables. These include:
Bernoulli distribution represents a binary random variable which takes value 1 with success probability $p$ and value 0 with failure probability $q=1-p$. A bernoulli distribution has only one parameter: $p$.
theta = 0.5
# let us draw a sample from a bernoulli distribution
b = bernoulli.rvs(theta,size=1)
print(b)
if b[0] == 0:
plt.imshow(tailimg)
plt.axis('off')
else:
plt.imshow(headimg)
plt.axis('off')
[0]
# you can also draw samples simultaneously
theta = 0.9
samples = bernoulli.rvs(theta,size=1000)
print(samples)
# count the number of successes (sample = 1). What happens when you change p?
print(np.count_nonzero(samples==1))
[1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1] 889
bernoulli.pmf(a, theta)
array([0.3, 0.7])
# plotting the probability mass function for the Bernoulli distribution
a = np.arange(2) # domain of the bernoulli variable
THETAS = [0.1,0.2,0.6,0.8] # different parameters
df = pd.DataFrame([bernoulli.pmf(a, theta) for theta in THETAS], columns=['tails','heads'])
df = df.assign(theta=THETAS)
melted_df = pd.melt(df, id_vars="theta")
melted_df
theta | variable | value | |
---|---|---|---|
0 | 0.1 | tails | 0.9 |
1 | 0.2 | tails | 0.8 |
2 | 0.6 | tails | 0.4 |
3 | 0.8 | tails | 0.2 |
4 | 0.1 | heads | 0.1 |
5 | 0.2 | heads | 0.2 |
6 | 0.6 | heads | 0.6 |
7 | 0.8 | heads | 0.8 |
fig = px.bar(melted_df, x="variable", y="value",
color="theta",
facet_col="theta",
title = "Bernoulli Probability",
labels = {"variable" : "Heads or Tails<br>(0 or 1)",
"value" : "Probability"},
height=400,
width=800)
fig.show()
colors = ['r','g','y','b']
plt.figure(figsize=(12,5))
for i, theta in enumerate([0.1, 0.2, 0.6, 0.7]):
ax = plt.subplot(1, 4, i+1)
plt.bar(a, bernoulli.pmf(a, theta), label=theta, color=colors[i], alpha=0.2)
ax.xaxis.set_ticks(a)
plt.legend(loc=0)
if i == 0:
plt.ylabel("PDF at $k$")
plt.suptitle("Bernoulli probability")
Text(0.5, 0.98, 'Bernoulli probability')
Another popular distribution for a discrete random variable is the binomial distribution. A binomial distribution has two parameters $n$ and $\theta$, where $0 \le \theta \le 1$. The sample generated by a binomial distribution denotes the number of successes observed in a sequence of $n$ binary trials (e.g., toss of a coin) when the probability of each success is $\theta$.
The samples that are drawn from a binomial distribution range between 0 and $n$.
The probability distribution is defined as: \begin{equation} p(k;n,\theta) = P(X = k) = \binom{n}{k}\theta^k (1 - \theta)^{n-k} \end{equation}
#sampling from a binomial distribution
sample = binom.rvs(20,0.9,1)
print(sample)
17
#plotting the pmf for a bernoulli distribution
plt.figure(figsize=(12,6))
k = np.arange(0, 22)
for p, color in zip([0.1, 0.3, 0.6, 0.8], colors):
rv = binom(20, p)
plt.plot(k, rv.pmf(k), lw=2, color=color, label=p)
plt.fill_between(k, rv.pmf(k), color=color, alpha=0.3)
plt.legend()
plt.title("Binomial distribution")
plt.tight_layout()
plt.ylabel("PDF at $k$")
plt.xlabel("$k$")
Text(0.5, 33.0, '$k$')
Typically used to model counts. Domain ($\chi$) is $0 ... \infty$. It has one parameter, $\lambda$.
The probability mass function is given as: $P(X = k) = \frac{\lambda^ke^{-\lambda}}{k!}$.
The expected value or mean of $X$ is $\lambda$.
all_dfs = []
for lambd in [10, 25,50, 100]:
rv = poisson(lambd)
# calculate the pmf for different values of k and plot
k = np.arange(200)
all_dfs.append(pd.DataFrame(zip(k, rv.pmf(k), [lambd]*len(k)),
columns=['count','probability','lambda'])
)
df = pd.concat(all_dfs)
fig = px.bar(df, x="count",y="probability",facet_col='lambda',
facet_col_wrap=2,
height=600, width=800)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()
You see that the Poisson is a bit weird, the variance and the mean are encoded in a single parameter! The negative binomial distribution is very similar (with almost the same generative story), except that it allows for a difference between the values of the mean and variance of the distribution.
# generating samples from this distribution
samples = rv.rvs(1000)
h = plt.hist(samples,bins=20,density=True)
plt.xlim([0,100])
(0.0, 100.0)
[M, n, N] = [52, 4, 5]
rv = hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_aces = rv.pmf(x)
[f'Prob of {i} aces: {j:.5f}'.format(i,j) for i,j in enumerate(pmf_aces)]
['Prob of 0 aces: 0.65884', 'Prob of 1 aces: 0.29947', 'Prob of 2 aces: 0.03993', 'Prob of 3 aces: 0.00174', 'Prob of 4 aces: 0.00002']
A continuous random variable can take an infinite number of possible values. Several interesting distributions exist:
One of the most popular distribution is the Gaussian distribution. This distribution is defined for any number of variables. For a single variable case, the distribution is defined using two parameters: $\mu$ and $\sigma$. $\mu$ or the mean can take any value and $\sigma$ or the standard deviation is $\ge 0$.
For a continuous distribution, you cannot compute the probability mass at any value of the random variable. Instead, you can compute the density using the probability density function: $$p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp[-\frac{1}{2}(\frac{x - \mu}{\sigma})^2]$$ The random variable represented using a Gaussian distribution can take any value from $-\infty$ to $\infty$.
# set the parameters
mu = 125
sigma = 8
# draw 1000 samples from this distribution
#samples = norm(mu, sigma).rvs(1000)
# plot an empirical distribution, i.e., a histogram
#h = plt.hist(samples, 30, density=True, alpha=.3)
# Compute the density at several instances of the random variable
x = np.linspace(90, 170, 10001)
# plot the density
plt.plot(x, norm(mu, sigma).pdf(x), linewidth=2)
[<matplotlib.lines.Line2D at 0x7fdb70c0ad30>]
Consider heights and weights of a population sample
hw = pd.read_csv('data/heightweight.csv')
fig = px.histogram(hw, x="Weight")
fig.show()
A fundamental component of multivariate statistics is a n-dimensional random vector, which is just a vector of random variables. It maps from a sample space $S$ into $\mathcal{R}^n$.
Let's simplify this, though, and just assume we have two random variables, $X$ and $Y$.
For a super neat visualization of conditional probability tables, check out the first section of this website.
The joint probability mass function (joint pmf) if $X$ and $Y$ are discrete is defined as $f(x,y) = P(X=x,Y=y)$. This completely defines the probability vectorof the random vector $(X,Y)$.
A function $f(x,y)$ is called a joint probability density function (joint pdf) if $X$ and $Y$ are continuous and for every $A \subset \mathcal{R}^2$:
$$P((X,Y) \in A) = \int_A \int f(x,y)dxdy$$Expectations of (functions of) two random variables are computed in the same ways as for single random variables.
In the discrete case: $$\mathbb{E}g(X,Y) = \sum_{(x,y)\in \mathcal{R}^2} g(x,y) f(x,y)$$.
Again, assume that we are working only with the random vector $(X, Y)$. The marginal distribution of $X$ in this case is given as:
$$f_X(x) = \sum_{y \in \mathcal{R}} f_{X,Y}(x,y)$$You marginalize over all values of $Y$, and then get a (well-specified) probability distribution over $X$.
Note the above is for the discrete case, the same follows in the expected fashion for multiple continous random variables.
The conditional probability $P(X|Y)$ is (like we saw above): $P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)}$.
The conditional distribution of $(X,Y)$ follows similarly:
$$f(y|x) = P(Y=y|X=x) = \frac{f(x,y)}{f_X(x)}$$Two variables are independent if for all $x \in \mathcal{R}$ and $y in \mathcal{R}$, $f(x,y) = f_X(x) f_Y(y)$. You can compute the probability of the two observations independently, and then just multiply them together!
When random variables are not independent, we want to know the strength of the relationship between them. One measure of that is the covariance, defined as:
$$\mathrm{Cov}(X,Y) = \mathbb{E}((X-\mu_X)(Y-\mu_Y))$$where $\mu_X = \mathbb{E}(X)$ and $\mu_Y = \mathbb{E}(Y)$.
The correlation between $X$ and $Y$ is defined by:
$$\rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} $$where $\sigma_X = \sqrt{\mathrm{Var}(X)}$, and analogously for $Y$.
Exercise: What is an example of two quantities that you would expect to see $\rho_{XY} = 0$? What about $\rho_{XY} = .9$? What about $\rho_{XY}= -.9$.
A reminder for later... correlation is not causation, yes, but also, be careful even with correlation!.
A multinoulli distribution is a generalization of Bernoulli distribution for trials which can take more than two possibilities ($k > 2$). The parameter for multinoulli distribution is a vector ${\bf \theta}$ which has $k$ entries. Each entry $\theta_i$ indicates the probability of observing the category $i$ in a single trial.
[1/6.]*6
# generate samples from a multinoulli distribution. Essentially simulated a single roll of dice. Note that the output is a vector of length $k = 6$
np.random.multinomial(1, [1/6.]*6, size=1)
A multinomial distribution is a generalization of Binomial distribution for trials which can take more than two possibilities. The parameters for the multinomial distribution is a vector ${\bf \theta}$ and $n$.
# generate samples from a multinomial distribution. Note that the output is a vector of length $k = 6$
np.random.multinomial(20, [1/6.]*6, size=1)
A distribution can be defined for multivariate random variables. One example is the multivariate Gaussian. In general, the random variable is a $D$ length vector ${\bf X}$. The two parameters of this distribution are a mean vector ${\bf \mu}$ and a covariance matrix $\Sigma$. The pdf at any value of ${\bf X}$ is given by: $$ \mathcal{N}({\bf X}| {\bf \mu,\Sigma}) \triangleq \frac{1}{(2\pi)^{D/2}|{\bf \Sigma}|^{D/2}}exp\left[-\frac{1}{2}{\bf (x-\mu)^\top\Sigma^{-1}(x-\mu)}\right] $$ Note that if $D = 1$, it reduces to a univariate Gaussian distribution.
#define the parameters for D = 2
mu = np.array([10,10])
Sigma = np.array([[4,1.],[1.,1]])
rv = np.random.multivariate_normal(mu,Sigma)
#sample some points
s = np.random.multivariate_normal(mu,Sigma,1000)
fig = plt.figure()
plt.subplot(111)
plt.scatter(s[:,0],s[:,1])
# add a contour plot
smin = np.min(s,axis=0)
smax = np.max(s,axis=0)
t1=np.linspace(smin[0],smax[0],1000)
t2=np.linspace(smin[1],smax[1],1000)
# evaluate pdf at each of these mesh points
np.cov(s.transpose())